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Abstract 

Background: The emerging multi-layers of genomic data have provided unprecedented opportunities for cancer 
research, especially for the association study between gene expressions and other types of genomic features. No 
previous approaches, however, provide an adequate statistical framework for or global analysis on the relative 
impact of different genomic feature layers to gene expression phenotypes. 

Methods: We propose an integrative statistical framework based on a sparse regression to model the impact of 
multi-layered genomic features on gene expression traits. The proposed approach can be regarded as an 
integrative expression Quantitative Traits Loci approach in which not only the genetic variations of SNPs or copy 
number variations but also other features in both genomic and epigenomic levels are used to explain the 
expression of genes. To highlight the validity of the proposed approach, the TCGA ovarian cancer dataset was 
analysed as a pilot task 

Results: The analysis shows that our integrative approach has consistently superior power in predicting gene 
expression levels compared to that from each single data type-based analysis. Moreover, the proposed method has 
the advantage of producing a substantially reduced number of spurious associations. We provide an interesting 
characterization of genes in terms of its genomic association patterns. Important genomic features reported in 
previous ovarian cancer research are successfully identified as major hubs in the resulting association network 
between heterogeneous types of genomic features and genes. 

Conclusions: In this paper, we model the gene expression phenotypes with respect to multiple different types of 
genomic data in an integrative framework. Our analysis reveals the global view on the relative contribution of 
different genomic feature types to gene expression phenotypes in ovarian cancer. 



Introduction 

Cancer is a complex disease mainly characterized by 
uncontrolled proliferation and cell growth. Genes regulat- 
ing differentiation and cell growth must be altered for a 
normal cell to transform into a cancer cell [1]. Expression 
of oncogenes or tumor suppressor genes promotes the 
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malignant phenotype of cancer cells or inhibits cell divi- 
sion, development, or survival of cancer cell, respectively 
[1]. In many respects, a general survey of gene expression 
phenotypes serves as a proxy for the nature and breadth of 
phenotypic variation in human cancer [2,3]. In addition, 
gene expression is strongly associated with other types of 
genomic data in genomic level or epigenomic level [4] . 

In order to identify the relation between gene expression 
and other types of genomic data, there have been many 
attempts for integrative analyses between them. The 
expression quantitative trait loci (eQTL) approach, which 
integrates large-scale genotype data and expression 
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profiles, has been established and offers a new perspective 
in biomedicine [5-8]. This approach treats gene expression 
profiles as quantitative traits or intermediate phenotypes 
and searches for genomic variation which can explain the 
variance of the molecular traits [9,10]. In addition to SNP 
data as a genome level, many integrative analyses between 
copy number variation and gene expression have been 
reported to identify genes that are associated with gene 
dosage [11-14]. In terms of epigenetic regulation, DNA 
methylation or histone modification can serve to regulate 
gene expression in cancer [15-18]. Furthermore, as one of 
the important regulators of gene expression, miRNA 
expression can be integrated with gene expression to iden- 
tify the selective degradation or selective inhibition of 
translation [19-21]. 

Despite these efforts, however, it only reveals a limited 
view on the genomic mechanisms underlying cancer with 
only a pair of genomic data at hand. Recently, the emerging 
multi-layers of genomic data have provided unprecedented 
opportunities to identify the global view of relations 
between multi-layers of genomic data. The Cancer Gen- 
ome Atlas (TCGA) is a large-scale collaborative initiative 
to improve understanding of cancer using multi-layers of 
genomic data. The TCGA research network recently pub- 
lished many notable papers on several cancers concerning 
an interim analysis of DNA sequencing, copy number, 
DNA methylation, miRNA, and gene expression data 
[22-26]. The International Cancer Genome Consortium 
(ICGC) is another multidisciplinary collaborative effort to 
characterize a comprehensive description of genomic, tran- 
scriptomic and epigenomic abnormalities in 50 different 
cancer types [27]. While the TCGA and ICGC open many 
opportunities to deepen the knowledge of the molecular 
basis of cancer [27-29], it is particularly important to access 
different levels of genomic data at hand for providing an 
enhanced global view on interplays between them. 

The emerging large-scale multi-layers of genomic data- 
set demand novel computational methods. There have 
been several integrative approaches for multi-layers of 
genomic data. For example, Chari et al. used the integra- 
tive analysis approach with multi-dimensional genomics 
data, enabling the understanding of mechanisms that dis- 
turb regulatory/signalling cascades and downstream effects 
[30]. Another relevant method, CNAmet, is an R package 
for integrative analysis of high-throughput copy number, 
DNA methylation, and gene expression data to identify 
genes that are amplified, hypomethylated and upregulated, 
or deleted, hypermethylated and downregulated [31]. In 
addition, other types of integrative methodological frame- 
work have been recently proposed to identify multi- 
dimensional regulatory modules from different levels of 
genomic data [32] or to combine different levels of geno- 
mic data for cancer clinical outcome prediction in the 
multiple-scale and the synergistic manner [33,34], which 



highlights the importance of integrative approaches utiliz- 
ing multi-omics data systematically. However, to the best 
of our knowledge, there has not been any comprehensive 
analysis on the relative contribution of different genomic 
data to gene expression phenotypes, nor an adequate sta- 
tistical approach to address this issue of elucidating gene 
expression phenotypes with more than two types of geno- 
mic data at hand. As different levels of genomic data such 
as copy number, SNP, methylation, or miRNA, might 
affect gene regulation through either specific or synergistic 
fashion, an integrative framework that incorporates all 
these different genomic features as potential regulators of 
gene expression will lead us to an enhanced global view 
on interplays between them (Figure 1). Simple correlation- 
based association tests will typically result in a large num- 
ber of associations redundantly appearing across different 
types of genomic features. This makes it difficult to accu- 
rately measure the relative impact of each genomic feature 
type to gene expression traits. In this paper, we propose a 
sparse regression based framework for elucidating expres- 
sion phenotype using different layers of genomic data as 
covariates. 

In order to demonstrate the validity of the synergistic 
impact of multiple genomic data on gene expression 
phenotypes, ovarian cancer data from TCGA was used 
as a pilot task. Ovarian cancer (OV) is one of the most 
common gynecological malignancies, and is the 5* lead- 
ing cause of cancer mortality in women in the United 
States [35]. Understanding the molecular pathogenesis 
and underlying biology in ovarian cancer through the 
global view on interplays between different levels of 
genomic data is expected to provide guidance for 
improved prognostic indicators and effective therapies. 

Through this pilot task, we validate that the predictive 
power of the proposed integrative model is consistently 
superior compared to those of the individual approaches 
that employ each type of genomic data separately. More- 
over, the proposed framework is substantially more effec- 
tive in reducing spurious associations between gene 
expression phenotypes and other genomic features. Com- 
parison of the resulting association networks from these 
two approaches also supports the validity of the proposed 
framework. Based on this validation, we provide more 
focused analyses on the inferred association network to 
highlight the biological significance of our findings. 

Materials and methods 

Data 

Datasets in ovarian cancer were retrieved from the Can- 
cer Genome Atlas (TCGA) data portal (http://tcga-data. 
nci.nih.gov) (Table 1). The beta-value of Infinium methy- 
lation 27 BeadChip, ranged from 0 to 1, was used for 
DNA methylation data. After obtaining beta-values of 
methylation probes, the final matrix of DNA methylation 
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Figure 1 The schematic overview of the integrative frameworl<. Different levels of genomic data sucln as copy number, methylation, or 
mlRNA might affect gene regulation through either specific or synergistic fashion. 



was constructed by segmenting the 27,578 methylation 
probes into 9,219 gene features after removing zero- 
values across all the 381 samples. Level 3 data of gene 
expression and miRNA expression from TCGA were 
used as an expression signal of a gene or miRNA, con- 
taining 12,042 genes and 799 miRNAs, respectively. Copy 
number alteration (CNA) data were obtained from cBio 
Cancer Genomics Portal in order to use the results of 
GISTIC which attempts to identify significantly altered 
regions of amplification or deletion across sets of patients 
[36]. CNA data contains 54 significant cytoband regions 
with categorical values, -2, -1, 0, 1, or 2. As we use 
methylation, CNA, and miRNA data as covariates in our 
predictive statistical framework, we denote these three 
types of data as different genomic feature types. 

Data preprocessing 

To minimize the effect of heterogeneity in feature-wise 
distributions, type-specific preprocessing schemes are 
applied to each genomic feature set. Methylation data hav- 
ing the greatest number of features are non-specifically fil- 
tered by variance such that methylation features with 



lower 25% variance are removed from the feature set. We 
use all the 799 microRNA features without further filtering 
not only because the number of miRNA features is rela- 
tively small in comparison with that from methylation 
data, but also because the overall variance of microRNA 
expression data were relatively high. Copy number altera- 
tion data have unique characteristics in that the copy 
number alteration event typically occurs across long range 
of loci on a chromosome rather than on each single locus 
or short regions. After exploring several different feature 
representation alternatives, for example, those based on 
either probes or genes, we chose to use cytoband-based 
copy number alteration features, the output of GISTIC, 
showing the best performance through our analyses below. 

We examine the impact of these resulting genomic fea- 
tures on each of the 12,042 gene expression traits. The 
feature values of all genomic data are finally normalized 
to have a zero mean and standard deviation of one across 
samples so that the relative impact of different genomic 
features on expression traits can be properly represented. 
Under this setting, let y;^ denote an TV-dimensional vector 
for expression traits of gene kinN = 381 individuals, and 



Table 1 Data description 



Cancer type 


Data type 


Platform 


# Features (d) after preprocessing 


OV 


CNA 


Agilent SurePrint G3 Human CGH Microarray Kit 1x1 M 


54 




Methylation 


infinium humanmethy!ation27 BeadChip 


6,913 




miRNA 


Agilent Human miRNA Microarray Rel12.0 


799 




Gene expression 


Affymetrix HT Human Genome U133 Array Plate Set 


1 2,042 
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let X^™^ denote an N x feature matrix for N samples 
and /™' number of genomic features where = 54, 6913, 
and 799 for ms {CNA, methylation, miRNA}, respectively. 

Simultaneous feature selection and regression by a 
sparse regression 

We first describe the baseline framework for modeling 
the effect from a single type of genomics features to gene 
expression traits. Typically, the association between 
genomic features and gene expression traits has been 
analyzed by a simple correlation test, either under a para- 
metric assumption or in non-parametric way. As this 
relies on a pair-wise test between each single feature and 
each single gene, it is not capable of modeling the syner- 
gistic effect from multiple features to an expression trait. 
Moreover, the simple correlation measure tends to pro- 
duce a large number of indirect genomic associations 
and does not reflect the possible interplays between 
potential regulators. The multiple testing issues caused 
by the huge number of pairwise tests also discourage the 
use of such a test for this type of integrative analysis. 

Instead of this, we employ a sparse regression frame- 
work that has recently emerged as a powerful tool for 
detecting associations in a high-dimensional space. Under 
this model, the impact of / possible features Xu, ... , Xji to a 
trait value yi is modeled as a multivariate linear regression 
as follows: 

Yi = Po + Pi Xu + Pi X2i + xji + Si, Si N{0, a^) 

where i denotes the index for different samples. The Li- 
penalized regression framework called lasso [37] solves the 
following optimization problem to detect a relatively small 
number of effective covariates affecting the trait 

min Ei(yi - (;So + PiXu + PiXn + ... P]X,i)f + X Ej|/8j| 

The second term of Li-penalty on P = ifi^, jSj) induces 
a sparse solution by reducing the number of non-zero coef- 
ficients in p. The regularization parameter X controlling the 
degree of sparsity is determined by cross-validation. There- 
fore, the solution given by lasso generates a set of a few fea- 
tures in association with the trait and the association 
strength of each effective feature / from Pj. We adopt a 
Screen and Clean procedure [38] on top of lasso as our 
baseline statistical framework to allow further filtering of 
detected features based on /"-values. We set the threshold 
for the /J-values as 0.05 throughout our analysis. 

We extend this baseline to an integrative model that 
deals with M different types of data as covariates assuming 
the following formulation: 

n = P'l' . x^'^ + pP' . x^^^i + . . . + pf"' . + ei, Si ~ N(0, a^) 

where P*""', x'^'i g r"*"' for m = 1, M. 



Note that we excluded the intercept Po because we 
already centered the data matrix to have zero mean col- 
umn-wise. Through this formulation, a trait may be 
impacted by either one type of genomic features, or by 
multiple types of genomic features synergistically. Since 
the selected TCGA dataset provides three different types 
of genomic data as genomic features, the final optimiza- 
tion problem we solve is: 

A E(||P™^||i + + ||p""™^||i) 

We denote the sparse solution of the above integrative 

setting by P'^'integrative- 

As a base case, the aforementioned lasso-based Screen 
and Clean procedure is applied to each pair of (X^"'\yk) 
separately for me {CNA, methylation, miRNA}, and for 
each gene k = 1,..., 12042. We denote the resulting coeffi- 
cient matrix by P''"'smgie- 

We validate the proposed integrative framework by 
comparing these two association networks implied by 
P'^'integrative and P''"'singie- Note that both the 'integrative' 
approach and the 'single type'-based approach generate 
pairs of genomic associations between a genomic feature 
and a gene expression phenotype, and the strength of the 
association given by the magnitude of the corresponding 
regression coefficient. 

Results 

Predictive power of the integrative feature Is consistently 
superior compared to that of each single type of 
genomic features 

One of the advantages of the sparse regression framework 
we adopt is that it is a predictive model and thus allows a 
quantitative performance evaluation. As a validation for 
the proposed integrative approach, we first compare the 
overall prediction accuracy of the integrative approach 
with those from each single genomic type based 
approaches using CNA, methylation, miRNA data sepa- 
rately. The average correlation coefficient between the 
actual gene expression levels and the predicted ones across 
samples is used as an accuracy measure. To examine the 
trend in the overall predictive power of each feature type, 
the genes are first partitioned into 10 equal-sized bins 
according to (a) the average expression levels |i across 
samples, (b) the standard deviation of the expression levels 
o, and (c) the ratio of the two n/ a. Then we removed the 
predicted associations with association strength smaller 
than a threshold p of 0.1 to filter out less confident asso- 
ciations in both approaches. The prediction accuracies on 
the resulting genes are displayed in Figure 2 (A,B,C). The 
number of genes predicted to be in association with at 
least one feature is also displayed along the same deciles 
(Figure 2 D,E,F). 
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Figure 2 Comparison of overall predictive power. The average prediction accuracy of the proposed integrative method is compared with 
those using each genomic feature type separately (A.B.Q. Genes are first partitioned into 10 equal-sized bins according to their average 
expression levels, standard deviation, and the ratio of the two, respectively, and then the result is displayed as a function of these deciles. The 
number of genes in each bin having at least one genomic association Is also displayed In D,E,F. 



A number of interesting association patterns are clearly 
seen from Figure 2. First of all, the overall predictive 
power of the integrative feature is consistently superior 
to that from each single type of genomic features across 
the deciles. We also find that highly expressed genes 
(with higher \i) tend to be explained better by copy num- 
ber alteration data, both before and after normalization 
by standard deviation (Figure 2 A,C). In contrast, variably 
expressed genes (with higher o in Figure 2B, or more 
apparently, with smaller \i/a in Figure 2C) tend to be 
explained better by the methylation features. This obser- 
vation confirms the expectation that methylation plays a 
more dynamic role in regulating the gene expression 
through a dynamic epigenetic mechanism. 

The number of genes having at least one genomic asso- 
ciation is slightly smaller in case of the integrative frame- 
work than in a single type based approach using either 
methylation or miRNA data only. Considering the super- 
ior prediction accuracy of the integrative approach, we 
conjecture that this is because the indirect or spurious 



associations are effectively reduced in the integrative 
approach. 

Integrative approach is effective in reducing spurious 
associations 

We next compare the number of genes specifically asso- 
ciated with each genomic data type and that of genes 
synergistically affected by multiple types of features. 
Recall that in our integrative approach, the expression 
level of a gene k is represented as a linear combination of 
CNA, miRNA, and methylation features using three coef- 
ficient vectors jg-'*^^, and p-^^'hylaUon ^ jg^gj 

one element in the estimated p'" is non-zero, we can say 
the gene is associated with the genomic feature type m. 
Therefore, a certain gene may be associated with more 
than one genomic feature type. 

The summary for the number of genes impacted by each 
genomic feature type is presented in Figure 3. First, the 
number of genes having at least one genomic association 
with each feature type is smaller in the case of the 
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Figure 3 Comparison of the number of genes specifically impacted by single type of features and synergistically affected by multiple 
types of data. The number of genes impacted by each type of genomic data is presented as a Venn diagram. The integrative approach 
predicts a significantly smaller number of genes synergistically impacted by multiple types of data, and a greater number of genes that are 
specifically impacted by only one genomic data type, in comparison with the single data type based analysis. The integrative approach produces 
substantially less number of spurious associations in general, which can lead to more accurate characterization of genes based on the genomic 
association. 



integrative approach, for example, we find 3,082 CNA 
associated genes versus 6,349 such genes in the integrative 
and the single type based (of CNA-only) analyses, respec- 
tively. As already mentioned above regarding the predic- 
tive accuracy, this appears to be because of the fewer 
number of false positives produced in the integrative 
approach. 

Furthermore, the single-type based approach estimated 
a very large number of genes that are impacted synergis- 
tically by all the three feature types, that is, 3,320 among 
total 12,042 genes, and more than half of either CNA, 
miRNA, or methylation associated genes. This seems to 
be a clear over-estimation of such genes. In contrast, 
the proposed integrative approach yields a substantially 
smaller number and fewer fractions of such synergisti- 
cally impacted genes (541 genes). As a result, a greater 
number of genes fall into the class impacted specifically 
by only one genomic type under our integrative frame- 
work (e.g. 1,066 CNA-only-associated genes, versus 759 
such genes in the single type-based analysis) even with a 
smaller number of total genes in association. This high- 
lights the potential utility of our integrative approach 
for characterizations genes based on the genomic asso- 
ciation and for the investigation of relative contribution 
of different genomic feature types as well. 

Heterogeneous genomic association network from the 
integrative approach has better modularity 

We provide a global outlook of the association networks 
estimated from both the integrative approach and the 
single type based approach. Figure 4 shows the heteroge- 
neous genomic association networks in which features 



from copy number alteration, methylation, miRNA or 
gene expression data are represented as nodes and the 
edges are constructed from the estimated non-zero 
regression coefficients P'^'integrative and P'™'singie- For bet- 
ter visualization, the network edges were further filtered 
with a threshold p = 0.3 and nodes without any con- 
nected edge were removed. The resulting networks reveal 
very different global topologies such as the number of 
connected components or the clustering coefficients. 
Overall, the one from the proposed approach clearly has 
better modularity as illustrated in Figure 4, which may 
imply more functionally coherent network modules in it. 

We performed functional enrichment test with respect 
to GO Biological Process for the set of genes in the largest 
connected component in each network. The enriched GO 
BP terms with the smallest FDR corrected /^-values are lar- 
gely related to defense response and immune system in 
both approaches. For example, the most significant term 
was GO:6952 defense response, and GO:9611 response to 
wounding, with FDR corrected jj-values of 2.65e-23 and 
1.13e-20, respectively, in the integrative and the single- 
type based approach. The integrative approach also 
detected GO:42330 taxis and GO:6935 chemotaxis (FDR 
corrected /"-values of 9.35e-12 and 9.35e-12, respectively) 
as the 9* and lO"' most significant terms, which has 
known to be essential in cancer progression and metasta- 
sis. In contrast, the single type-based analysis tends to pro- 
duce more broad terms such as GO:48856 anatomical 
structure development or GO:48731 system development 
(4'*' and 5'^ FDR corrected jf-values of 3.53e-18 and 
1.72e-17, respectively) other than the aforementioned 
common terms. 
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Figure 4 Comparison of the heterogeneous association networl<s recovered from the integrative and individual approaches The 

association networl<s having heterogeneous types of genomic features and expression traits as nodes and their associations as edges are drawn 
using Cytoscape. Overall, the one from the proposed approach reveals enhanced network modularity. 



Relative contribution of each genomic feature type to 
gene expression phenotypes 

We now perform more focused analysis on the associa- 
tion network estimated under the proposed framework. 
In Figure 5, the relative contribution of each genomic 
feature type to gene expression traits is characterized by 
looking at an increasing number of detected genomic 
associations. Specifically, the proportion of genes in asso- 
ciation with each genomic feature type calculated using 
the top K strongest genomic effects are shown for K = 
100,200,400,800,1600,3200, and 6400. 

First, the effect of methylation feature was the most 
dominant overall. Among the top 100 genomic associa- 
tions, only one association was occurring with copy num- 
ber alteration feature (CCNEl expression and copy 
number alteration at 19ql2), 24 were with miRNAs, and 
the remaining 75 were with methylation features. We 
noticed that among 75 methylation-mediated associa- 
tions, 29 were the trivial associations of a gene with the 
methylation near the same gene. This may have caused 



to overestimate the proportion of methylation-affected 
genes when we consider too small number of top signals. 
As we increase the number of top genomic effects K, the 
proportion of CNA-associated genes steadily increases 
from 0.01 to 0.27 while that of methylation-associated 
genes decreases from 0.75 to 0.5 until K = 3200 and 
remains similar after that. The proportion of miRNA- 
associated genes remains similar across K, ranging from 
0.22 to 0.28. While methylation seems to contribute the 
most to gene expression variation in ovarian cancer, the 
impacts of CNA and miRNA are also surprisingly signifi- 
cant considering the substantially smaller number of 
used features (54 and 799, respectively) than that of 
methylation features (6913). 

Hub genomic features and the functional characterization 
of the co-affected genes 

We further zoom into the association network and exam- 
ine the top 10 hub genomic features impacting the largest 
number of genes. The hub genomic features and the 
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considering the smaller number of used features. 



most significant GO terms and SP-PIR (Swiss-Prot and 
Protein Information Resource) keywords for the set of 
associated genes are listed in Table 2. For instance, 
miRNA-768-5p, which has previously shown to play an 
important role in ovarian cancer progression [39], has 
the greatest number of 303 genes as neighbours. miRNA- 
768-5p was also identified as microRNA signatures of 
tumor-derived exosomes for the diagnostic biomarkers of 
ovarian cancer [40]. The functional enrichment test using 
DAVID shows that the affected genes of miRNA-768-5p 
are the most significantly enriched with SP-PIR keyword 
acetylation (FDR corrected /7-value = 1.4e-19). In addi- 
tion, miRNA-29-a is well known of oncosuppressor 
miRNA, which is frequently lost or down-regulated in 
cancer so that target oncoproteins like CDK6, MCLl, or 
BCL-2 can be upregulated [41]. Among the top 10 hub 
features, 7 of them were copy number alteration features, 
supporting the significance of copy number alteration 
event in cancer progression and treatment. CNA features 
affect several genes associated with acetylation, phospho- 
protein, or nucleus. Methylation of Sprouty-4 (SPRY4), 



an inhibitor of the receptor-transduced mitogen-acti- 
vated protein kinase (MAPK) signalling pathway, has 
been detected in prostate cancer [42]. Methylation at 
SPRY4 affects 119 genes in downstream, which also is 
related to acetylation. 

Discussion and conclusion 

We proposed to elucidate the gene expression pheno- 
types with multiple different types of genomic features 
together to gain better insight on the global genomic 
mechanism underlying cancers. Through the analysis of 
TCGA ovarian cancer dataset, we validated the proposed 
integrative framework in various aspects. The proposed 
approach provided a systematic view on the relative con- 
tribution of different types of genomic data on the 
expression of genes. Since different levels of genomic 
data might affect gene regulation through either partly 
independent or partly complementary fashion, proposed 
framework that incorporates all these different genomic 
features as potential regulators of gene expression will 
lead us to an enhanced global view on interplays between 
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Table 2 Top 10 hub genomic features impacting the largest number of genes. 

Genomic feature Type N Most significant GO term p-value (FDR) Most significant SP-PIR keyword p-value (FDR) 



hsa.miR.768.5p 


miRNA 


303 


GO:0005739 


8.67E-09 


acetylation 


1.10E-22 








mitochondrion 


(1.18E-05) 




(1.40E-19) 


hsa.miR,29a 


miRNA 


154 


GO:0007049 


7.99E-20 


cell cycle 


7.30E-1 7 








cell cycle 


(1.21E-16) 




(1.44E-13) 


SPRY4 


methylation 


119 


GO:0003723 


3.19E-4 


acetylation 


1.09E-6 








RNA binding 


(0.4197) 




(0.0013) 


16p13.3 


CNA 


117 


GO:0031974 


4.10E-5 


iron-sulfur 


9.98E-5 








membrane-enclosed lumen 


(0.0506) 




(0.1257) 


12q23.1 


CNA 


113 


GO:0070013 


4.29E-5 


acetylation 


2.62E-6 








intracellular organelle lumen 


(0.0534) 




(0.0032) 


1q42.3 


CNA 


113 


GO:0012505 


7.3E-4 


phosphoprotein 


4.71 E-6 








endomembrane system 


(0.8908) 




(0.0059) 


6p21.1 


CNA 


109 


GO:0042974 


4.86E-8 


nucleus 


1.84E-5 








retinoic acid receptor binding 


(6.38E-5) 




(0.0228) 


17q25.3 


CNA 


108 


non-membrane-bounded 


0.0092 


acetylation 


6.39E-9 








organelle 


(11.23) 




(8.03E-6) 


1p36.11 


CNA 


103 


GO:00 16071 


0.0046 


phosphoprotein 


131E-4 








mRNA metabolic process 


(6.86) 




(0.1632) 


19p13.12 


CNA 


102 


GO:0043232 


1 .24E-5 


nucleus 


1.24E-5 








intracellular non-membrane-bounded 


(0.0150) 




(0.0149) 



organelle 

The most significantly enriched GO terms and SP-PIR keywords for the set of associated genes are listed together. 



them. Understanding the molecular pathogenesis and 
underlying complex mechanisms in ovarian cancer 
through the global view on interplays between them is 
expected to provide guidance for improved prognostic 
indicators and effective therapies [33] . 

The proposed approach may be regarded as an integra- 
tive eQTL approach in which not only the genetic varia- 
tions of SNPs but also other features in both genomic and 
epigenomic levels are used to explain the expression of 
genes. The original purpose of eQTL is to search genomic 
variations which can explain the variance of the gene 
expression as an intermediate phenotype. Thus, it can be 
conceptually extended to integrative approach with other 
levels of genomic features in order to better explain gene 
expression as a phenotype level. Since TCGA does not 
provide SNP data publicly, we excluded SNPs from our 
analysis. Thus, we used copy number data as a feature in 
genome level in this study. However, integration with SNP 
data will provide opportunities to investigate the genetic 
associations as well as the epigenetic associations in a 
principled way. 

One limitation of the proposed approach is the para- 
metric assumption of normal distribution for genomic 
features, which is not valid in general. We leave this 
investigation about the deviation from the parametric 
assumption and possible improvement as our future 
work. Another interesting direction for further research 
would be the integration with existing biological knowl- 
edge. Systematic schemes for the choice, representation, 
and incorporation of such knowledgebase remains as our 
further research plan. 



We used ovarian cancer dataset, which is one of the 
datasets in the first phase of TCGA project, as a pilot 
task for the study. However, TCGA has been generating 
additional cancer genomic data for about 25 tumor types 
as the second phase of the project, mainly sequencing- 
based data. Since our proposed method is flexible to use 
any kind of multi-omics data, it will be easily extended to 
other cancer types as a future work. 
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